Automatically Extracting Complex Data Structures from the Web

نویسندگان

Laura Fontán

Rafael López-García

Manuel Álvarez

Alberto Pan

چکیده

This paper presents a new technique for detecting and extracting lists of structured records from Web pages. With respect to most of the state-of-the-art systems, our approach is capable of detecting nested data structures (sublists) and it also incorporates some heuristics to delete unwanted content such as banners and navigation menus from the data region. This article also describes the experiments we have performed to validate the system. The precision and recall we have obtained in our tests surpass 90%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ایجاد نیمه خودکار مشاپ های سازمانی با استفاده از توصیفات معنایی

Mashups are next generation of web applications. A mashup is a lightweight web application that is created by combining information or capabilities from more than one existing resources to deliver a new and integrated experience to the user. Mashups introduce a new class of integration techniques in enterprises for implementing situational applications (i.e. applications that come together to s...

متن کامل

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

Recognition of logical units in log files

With the development of new technologies more and more information is stored in log files. Analyzing such logs can be very useful for the decision maker. One of the probably best known example is the Web log file analysis where lots of e cient tools have been proposed to extract the top-k accessed pages, the best users or even the patterns describing the behaviors of users on a Web site. These ...

متن کامل

Extracting Data Records from Query Result Pages Based on Visual Features

Web databases contain a large amount of structured data which are accessible via their query interfaces only. Query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The problem of automatically extracting data records from query result pages is critical for web data integration applications, such as comparison shopping sites, meta-sea...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Automatically Extracting Complex Data Structures from the Web

نویسندگان

چکیده

منابع مشابه

ایجاد نیمه خودکار مشاپ های سازمانی با استفاده از توصیفات معنایی

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

Recognition of logical units in log files

Extracting Data Records from Query Result Pages Based on Visual Features

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

عنوان ژورنال:

اشتراک گذاری